The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The datasets are made available to public for the purpose of health data analysis. The dataset related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative. It has been observed that in the past 15 years , there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project we have considered data from year 2000-2015 for 193 countries for further analysis. The individual data files have been merged together into a single dataset. On initial visual inspection of the data showed some missing values. As the datasets were from WHO, we found no evident errors. Missing data was handled in R software by using Missmap command. The result indicated that most of the missing data was for population, Hepatitis B and GDP. The missing data were from less known countries like Vanuatu, Tonga, Togo,Cabo Verde etc. Finding all data for these countries was difficult and hence, it was decided that we exclude these countries from the final model dataset. The final merged file(final dataset) consists of 22 Columns and 2938 rows which meant 20 predicting variables. All predicting variables was then divided into several broad categories:Immunization related factors, Mortality factors, Economical factors and Social factors.
This dataset has been sourced from Kaggle. Source link https://www.kaggle.com/kumarajarshi/life-expectancy-who.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.1
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.1
## Warning: package 'tibble' was built under R version 4.1.1
## Warning: package 'tidyr' was built under R version 4.1.1
## Warning: package 'readr' was built under R version 4.1.1
## Warning: package 'purrr' was built under R version 4.1.1
## Warning: package 'dplyr' was built under R version 4.1.1
## Warning: package 'stringr' was built under R version 4.1.1
## Warning: package 'forcats' was built under R version 4.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.1
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#Reading csv file
life_expectancy_data <- read.csv("C:/Users/shah3sw/OneDrive - University of Cincinnati/Data_Analysis_Method_Project/Life Expectancy Data.csv")
head(life_expectancy_data)
## Country Year Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing 65.0 263 62
## 2 Afghanistan 2014 Developing 59.9 271 64
## 3 Afghanistan 2013 Developing 59.9 268 66
## 4 Afghanistan 2012 Developing 59.5 272 69
## 5 Afghanistan 2011 Developing 59.2 275 71
## 6 Afghanistan 2010 Developing 58.8 279 74
## Alcohol percentage.expenditure Hepatitis.B Measles BMI under.five.deaths
## 1 0.01 71.279624 65 1154 19.1 83
## 2 0.01 73.523582 62 492 18.6 86
## 3 0.01 73.219243 64 430 18.1 89
## 4 0.01 78.184215 67 2787 17.6 93
## 5 0.01 7.097109 68 3013 17.2 97
## 6 0.01 79.679367 66 1989 16.7 102
## Polio Total.expenditure Diphtheria HIV.AIDS GDP Population
## 1 6 8.16 65 0.1 584.25921 33736494
## 2 58 8.18 62 0.1 612.69651 327582
## 3 62 8.13 64 0.1 631.74498 31731688
## 4 67 8.52 67 0.1 669.95900 3696958
## 5 68 7.87 68 0.1 63.53723 2978599
## 6 66 9.20 66 0.1 553.32894 2883167
## thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1 17.2 17.3 0.479
## 2 17.5 17.5 0.476
## 3 17.7 17.7 0.470
## 4 17.9 18.0 0.463
## 5 18.2 18.2 0.454
## 6 18.4 18.4 0.448
## Schooling
## 1 10.1
## 2 10.0
## 3 9.9
## 4 9.8
## 5 9.5
## 6 9.2
#Dimensions : Gives numbers of rows and columns
dim(life_expectancy_data)
## [1] 2938 22
# Structure of dataset
str(life_expectancy_data)
## 'data.frame': 2938 obs. of 22 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Year : int 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
## $ Status : chr "Developing" "Developing" "Developing" "Developing" ...
## $ Life.expectancy : num 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
## $ Adult.Mortality : int 263 271 268 272 275 279 281 287 295 295 ...
## $ infant.deaths : int 62 64 66 69 71 74 77 80 82 84 ...
## $ Alcohol : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
## $ percentage.expenditure : num 71.3 73.5 73.2 78.2 7.1 ...
## $ Hepatitis.B : int 65 62 64 67 68 66 63 64 63 64 ...
## $ Measles : int 1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
## $ BMI : num 19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
## $ under.five.deaths : int 83 86 89 93 97 102 106 110 113 116 ...
## $ Polio : int 6 58 62 67 68 66 63 64 63 58 ...
## $ Total.expenditure : num 8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
## $ Diphtheria : int 65 62 64 67 68 66 63 64 63 58 ...
## $ HIV.AIDS : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ GDP : num 584.3 612.7 631.7 670 63.5 ...
## $ Population : num 33736494 327582 31731688 3696958 2978599 ...
## $ thinness..1.19.years : num 17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
## $ thinness.5.9.years : num 17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
## $ Income.composition.of.resources: num 0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
## $ Schooling : num 10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
#Summary
summary(life_expectancy_data)
## Country Year Status Life.expectancy
## Length:2938 Min. :2000 Length:2938 Min. :36.30
## Class :character 1st Qu.:2004 Class :character 1st Qu.:63.10
## Mode :character Median :2008 Mode :character Median :72.10
## Mean :2008 Mean :69.22
## 3rd Qu.:2012 3rd Qu.:75.70
## Max. :2015 Max. :89.00
## NA's :10
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1.0 Min. : 0.0 Min. : 0.0100 Min. : 0.000
## 1st Qu.: 74.0 1st Qu.: 0.0 1st Qu.: 0.8775 1st Qu.: 4.685
## Median :144.0 Median : 3.0 Median : 3.7550 Median : 64.913
## Mean :164.8 Mean : 30.3 Mean : 4.6029 Mean : 738.251
## 3rd Qu.:228.0 3rd Qu.: 22.0 3rd Qu.: 7.7025 3rd Qu.: 441.534
## Max. :723.0 Max. :1800.0 Max. :17.8700 Max. :19479.912
## NA's :10 NA's :194
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 1.00 Min. : 0.0 Min. : 1.00 Min. : 0.00
## 1st Qu.:77.00 1st Qu.: 0.0 1st Qu.:19.30 1st Qu.: 0.00
## Median :92.00 Median : 17.0 Median :43.50 Median : 4.00
## Mean :80.94 Mean : 2419.6 Mean :38.32 Mean : 42.04
## 3rd Qu.:97.00 3rd Qu.: 360.2 3rd Qu.:56.20 3rd Qu.: 28.00
## Max. :99.00 Max. :212183.0 Max. :87.30 Max. :2500.00
## NA's :553 NA's :34
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 3.00 Min. : 0.370 Min. : 2.00 Min. : 0.100
## 1st Qu.:78.00 1st Qu.: 4.260 1st Qu.:78.00 1st Qu.: 0.100
## Median :93.00 Median : 5.755 Median :93.00 Median : 0.100
## Mean :82.55 Mean : 5.938 Mean :82.32 Mean : 1.742
## 3rd Qu.:97.00 3rd Qu.: 7.492 3rd Qu.:97.00 3rd Qu.: 0.800
## Max. :99.00 Max. :17.600 Max. :99.00 Max. :50.600
## NA's :19 NA's :226 NA's :19
## GDP Population thinness..1.19.years
## Min. : 1.68 Min. :3.400e+01 Min. : 0.10
## 1st Qu.: 463.94 1st Qu.:1.958e+05 1st Qu.: 1.60
## Median : 1766.95 Median :1.387e+06 Median : 3.30
## Mean : 7483.16 Mean :1.275e+07 Mean : 4.84
## 3rd Qu.: 5910.81 3rd Qu.:7.420e+06 3rd Qu.: 7.20
## Max. :119172.74 Max. :1.294e+09 Max. :27.70
## NA's :448 NA's :652 NA's :34
## thinness.5.9.years Income.composition.of.resources Schooling
## Min. : 0.10 Min. :0.0000 Min. : 0.00
## 1st Qu.: 1.50 1st Qu.:0.4930 1st Qu.:10.10
## Median : 3.30 Median :0.6770 Median :12.30
## Mean : 4.87 Mean :0.6276 Mean :11.99
## 3rd Qu.: 7.20 3rd Qu.:0.7790 3rd Qu.:14.30
## Max. :28.60 Max. :0.9480 Max. :20.70
## NA's :34 NA's :167 NA's :163
#Check for missing values
colSums(is.na(life_expectancy_data))
## Country Year
## 0 0
## Status Life.expectancy
## 0 10
## Adult.Mortality infant.deaths
## 10 0
## Alcohol percentage.expenditure
## 194 0
## Hepatitis.B Measles
## 553 0
## BMI under.five.deaths
## 34 0
## Polio Total.expenditure
## 19 226
## Diphtheria HIV.AIDS
## 19 0
## GDP Population
## 448 652
## thinness..1.19.years thinness.5.9.years
## 34 34
## Income.composition.of.resources Schooling
## 167 163
We found out the missing value in each variable
We will replace blank values with mean to avoid errors in our analysis.
# Select numeric variables for calculating mean
life_expectancy_data_num <- select(life_expectancy_data,-c(1,2,3))
#Calculate means of all the numeric variables
colMeans(life_expectancy_data_num, na.rm = TRUE)
## Life.expectancy Adult.Mortality
## 6.922493e+01 1.647964e+02
## infant.deaths Alcohol
## 3.030395e+01 4.602861e+00
## percentage.expenditure Hepatitis.B
## 7.382513e+02 8.094046e+01
## Measles BMI
## 2.419592e+03 3.832125e+01
## under.five.deaths Polio
## 4.203574e+01 8.255019e+01
## Total.expenditure Diphtheria
## 5.938190e+00 8.232408e+01
## HIV.AIDS GDP
## 1.742103e+00 7.483158e+03
## Population thinness..1.19.years
## 1.275338e+07 4.839704e+00
## thinness.5.9.years Income.composition.of.resources
## 4.870317e+00 6.275511e-01
## Schooling
## 1.199279e+01
# Impute missing values in numeric variables with mean
for(i in 4:ncol(life_expectancy_data)) {
life_expectancy_data[ , i][is.na(life_expectancy_data[ , i])] <- mean(life_expectancy_data[ , i], na.rm=TRUE)
}
summary(life_expectancy_data)
## Country Year Status Life.expectancy
## Length:2938 Min. :2000 Length:2938 Min. :36.30
## Class :character 1st Qu.:2004 Class :character 1st Qu.:63.20
## Mode :character Median :2008 Mode :character Median :72.00
## Mean :2008 Mean :69.22
## 3rd Qu.:2012 3rd Qu.:75.60
## Max. :2015 Max. :89.00
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1.0 Min. : 0.0 Min. : 0.010 Min. : 0.000
## 1st Qu.: 74.0 1st Qu.: 0.0 1st Qu.: 1.093 1st Qu.: 4.685
## Median :144.0 Median : 3.0 Median : 4.160 Median : 64.913
## Mean :164.8 Mean : 30.3 Mean : 4.603 Mean : 738.251
## 3rd Qu.:227.0 3rd Qu.: 22.0 3rd Qu.: 7.390 3rd Qu.: 441.534
## Max. :723.0 Max. :1800.0 Max. :17.870 Max. :19479.912
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 1.00 Min. : 0.0 Min. : 1.00 Min. : 0.00
## 1st Qu.:80.94 1st Qu.: 0.0 1st Qu.:19.40 1st Qu.: 0.00
## Median :87.00 Median : 17.0 Median :43.00 Median : 4.00
## Mean :80.94 Mean : 2419.6 Mean :38.32 Mean : 42.04
## 3rd Qu.:96.00 3rd Qu.: 360.2 3rd Qu.:56.10 3rd Qu.: 28.00
## Max. :99.00 Max. :212183.0 Max. :87.30 Max. :2500.00
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 3.00 Min. : 0.370 Min. : 2.00 Min. : 0.100
## 1st Qu.:78.00 1st Qu.: 4.370 1st Qu.:78.00 1st Qu.: 0.100
## Median :93.00 Median : 5.938 Median :93.00 Median : 0.100
## Mean :82.55 Mean : 5.938 Mean :82.32 Mean : 1.742
## 3rd Qu.:97.00 3rd Qu.: 7.330 3rd Qu.:97.00 3rd Qu.: 0.800
## Max. :99.00 Max. :17.600 Max. :99.00 Max. :50.600
## GDP Population thinness..1.19.years
## Min. : 1.68 Min. :3.400e+01 Min. : 0.10
## 1st Qu.: 580.49 1st Qu.:4.189e+05 1st Qu.: 1.60
## Median : 3116.56 Median :3.676e+06 Median : 3.40
## Mean : 7483.16 Mean :1.275e+07 Mean : 4.84
## 3rd Qu.: 7483.16 3rd Qu.:1.275e+07 3rd Qu.: 7.10
## Max. :119172.74 Max. :1.294e+09 Max. :27.70
## thinness.5.9.years Income.composition.of.resources Schooling
## Min. : 0.10 Min. :0.0000 Min. : 0.00
## 1st Qu.: 1.60 1st Qu.:0.5042 1st Qu.:10.30
## Median : 3.40 Median :0.6620 Median :12.10
## Mean : 4.87 Mean :0.6276 Mean :11.99
## 3rd Qu.: 7.20 3rd Qu.:0.7720 3rd Qu.:14.10
## Max. :28.60 Max. :0.9480 Max. :20.70
# We can see that now the data set has no missing values
colSums(is.na(life_expectancy_data))
## Country Year
## 0 0
## Status Life.expectancy
## 0 0
## Adult.Mortality infant.deaths
## 0 0
## Alcohol percentage.expenditure
## 0 0
## Hepatitis.B Measles
## 0 0
## BMI under.five.deaths
## 0 0
## Polio Total.expenditure
## 0 0
## Diphtheria HIV.AIDS
## 0 0
## GDP Population
## 0 0
## thinness..1.19.years thinness.5.9.years
## 0 0
## Income.composition.of.resources Schooling
## 0 0
While predicting life expectancy there could be few outliers that we need to ignore.
#Plotting box plots of life expectancy to understand outliers
boxplot(life_expectancy_data$Life.expectancy, xlab="Life Expectancy")
From the box plot we can see that age below 45 is outlier. Our analysis is not applicable for these records.
Now we will perform linear regression to identify how each factor contributes to the life expectancy of a person.
Let’s start with the field “Percentage Expenditure”
Percentge Expenditure represents expenditure on health as a percentage of Gross Domestic Product per capita(%)
First, lets find out correlation between Percentage Expenditure and Life Expectancy
#Plotting box plots of life expectancy to understand outliers
cor(life_expectancy_data$Life.expectancy, life_expectancy_data$percentage.expenditure)
## [1] 0.3817912
model_per_expenditure <- lm(percentage.expenditure ~ Life.expectancy, life_expectancy_data)
summary(model_per_expenditure)
##
## Call:
## lm(formula = percentage.expenditure ~ Life.expectancy, data = life_expectancy_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2252.1 -940.3 -433.9 274.3 17626.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4787.782 249.204 -19.21 <2e-16 ***
## Life.expectancy 79.827 3.566 22.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1838 on 2936 degrees of freedom
## Multiple R-squared: 0.1458, Adjusted R-squared: 0.1455
## F-statistic: 501 on 1 and 2936 DF, p-value: < 2.2e-16
The value of 0.3817912 indicates that there is a moderate positive correlation between percentage expenditure and life expectancy. Estimated coefficient for percentage expenditure is statistically significant, as the associated p value is less than 0.05.
Interpretation would be for every 1k percentage expenditure increase life expectancy increases by 79.827 years
library(plotly)
life_expectancy_vs_percenntage_expenditure <- plot_ly(data = life_expectancy_data, x = ~percentage.expenditure, y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(0, 255, 127, .9)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_percenntage_expenditure <- life_expectancy_vs_percenntage_expenditure %>% layout(title = 'Scatter Plot: Life Expectancy vs Percentage Expenditure',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_percenntage_expenditure
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Similar Analysis could be done for other variables.
Hepatitis B Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
library(plotly)
life_expectancy_vs_Hepatitis_B <- plot_ly(data = life_expectancy_data, x = ~Hepatitis.B, y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(0,255,0, .9)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_Hepatitis_B <- life_expectancy_vs_Hepatitis_B %>% layout(title = 'Scatter Plot: Life Expectancy vs Hepatitis B',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_Hepatitis_B
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Measles represents the number of reported cases per 1000 population
library(plotly)
life_expectancy_vs_Measles <- plot_ly(data = life_expectancy_data, x = ~Measles , y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(221,160,221, .9)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_Measles <- life_expectancy_vs_Measles %>% layout(title = 'Scatter Plot: Life Expectancy vs Measles',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_Measles
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
BMI represents average Body Mass Index of entire population
library(plotly)
life_expectancy_vs_BMI <- plot_ly(data = life_expectancy_data, x = ~BMI, y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(255,182,193, .9)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_BMI <- life_expectancy_vs_BMI %>% layout(title = 'Scatter Plot: Life Expectancy vs BMI',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_BMI
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Under five deaths represents the number of under-five deaths per 1000 population
library(plotly)
life_expectancy_vs_under_five_deaths <- plot_ly(data = life_expectancy_data, x = ~under.five.deaths , y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(152,251,152, .9)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_under_five_deaths <- life_expectancy_vs_under_five_deaths %>% layout(title = 'Scatter Plot: Life Expectancy vs Under five deaths',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_under_five_deaths
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Polio represents the number of under-five deaths per 1000 population
library(plotly)
life_expectancy_vs_Polio <- plot_ly(data = life_expectancy_data, x = ~Polio , y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(255,0,255, .9)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_Polio <- life_expectancy_vs_Polio %>% layout(title = 'Scatter Plot: Life Expectancy vs Polio',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_Polio
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Total expenditure represents general government expenditure on health as a percentage of total government expenditure (%)
library(plotly)
life_expectancy_vs_Total_expenditure <- plot_ly(data = life_expectancy_data, x = ~Total.expenditure , y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(30,144,255, .9)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_Total_expenditure <- life_expectancy_vs_Total_expenditure %>% layout(title = 'Scatter Plot: Life Expectancy vs Total expenditure',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_Total_expenditure
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Diphtheria Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
library(plotly)
life_expectancy_vs_Diphtheria <- plot_ly(data = life_expectancy_data, x = ~Diphtheria , y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(0, 255, 127, .9)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_Diphtheria <- life_expectancy_vs_Diphtheria %>% layout(title = 'Scatter Plot: Life Expectancy vs Diphtheria ',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_Diphtheria
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
thinness 1 to 19 years Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
library(plotly)
life_expectancy_vs_thinness_1_19_years <- plot_ly(data = life_expectancy_data, x = ~thinness..1.19.years , y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(129, 216, 210, .9)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_thinness_1_19_years <- life_expectancy_vs_thinness_1_19_years %>% layout(title = 'Scatter Plot: Life Expectancy vs Thinness 1 to 19 years',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_thinness_1_19_years
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
thinness 5 to 9 years Prevalence of thinness among children for Age 5 to 9(%)
library(plotly)
life_expectancy_vs_thinness_5_9_years <- plot_ly(data = life_expectancy_data, x = ~thinness.5.9.years , y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(181, 201, 253, .9)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_thinness_5_9_years <- life_expectancy_vs_thinness_5_9_years %>% layout(title = 'Scatter Plot: Life Expectancy vs Thinness 5 to 9 years',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_thinness_5_9_years
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Income composition of resources Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
library(plotly)
life_expectancy_vs_Income_composition_of_resources <- plot_ly(data = life_expectancy_data, x = ~Income.composition.of.resources , y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(181, 201, 253, .9)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_Income_composition_of_resources <- life_expectancy_vs_Income_composition_of_resources %>% layout(title = 'Scatter Plot: Life Expectancy vs Income composition of resources',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_Income_composition_of_resources
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
GDP Gross Domestic Product per capita (in USD)
library(plotly)
life_expectancy_vs_GDP <- plot_ly(data = life_expectancy_data, x = ~GDP , y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(152, 215, 182, .9)',
line = list(color = 'rgba(255, 0, 38, 0.2)',
width = 2)))
life_expectancy_vs_GDP <- life_expectancy_vs_GDP %>% layout(title = 'Scatter Plot: Life Expectancy vs GDP ',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_GDP
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Alcohol Gross Domestic Product per capita (in USD)
library(plotly)
life_expectancy_vs_Alcohol <- plot_ly(data = life_expectancy_data, x = ~Alcohol , y = ~Life.expectancy,
marker = list(size = 10,
color = 'rgba(152, 215, 182, .9)',
line = list(color = 'rgba(0, 0, 0, 0)',
width = 2)))
life_expectancy_vs_Alcohol <- life_expectancy_vs_Alcohol %>% layout(title = 'Scatter Plot: Life Expectancy vs Alcohol ',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
life_expectancy_vs_Alcohol
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Now that we have seen linear regression showing the relationship of Life Expectancy with each independent variables. Let’s analyse the data set using multiple linear regression.
Multiple linear regression is an extension of simple linear regression used to predict an outcome variable (y) on the basis of multiple distinct predictor variables (x).
With three predictor variables (x), the prediction of y is expressed by the following equation:
y = b0 + b1x1 + b2x2 + b3*x3
The “b” values are called the regression weights (or beta coefficients). They measure the association between the predictor variable and the outcome. “b_j” can be interpreted as the average effect on y of a one unit increase in “x_j”, holding all other predictors fixed.
library(tidyverse)
model <- lm(Life.expectancy ~ Alcohol + percentage.expenditure + Hepatitis.B + Measles + BMI + under.five.deaths + Polio+ Total.expenditure + Diphtheria + thinness..1.19.years + thinness.5.9.years + Income.composition.of.resources, data = life_expectancy_data_num)
summary(model)
##
## Call:
## lm(formula = Life.expectancy ~ Alcohol + percentage.expenditure +
## Hepatitis.B + Measles + BMI + under.five.deaths + Polio +
## Total.expenditure + Diphtheria + thinness..1.19.years + thinness.5.9.years +
## Income.composition.of.resources, data = life_expectancy_data_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.1522 -2.5539 0.4574 2.7976 21.8127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.842e+01 7.987e-01 60.627 < 2e-16 ***
## Alcohol -3.403e-02 3.628e-02 -0.938 0.348262
## percentage.expenditure 7.292e-04 7.953e-05 9.169 < 2e-16 ***
## Hepatitis.B 3.656e-03 6.144e-03 0.595 0.551916
## Measles 2.044e-05 1.566e-05 1.305 0.191941
## BMI 7.449e-02 7.662e-03 9.722 < 2e-16 ***
## under.five.deaths -1.000e-03 1.087e-03 -0.920 0.357907
## Polio 3.495e-02 7.241e-03 4.828 1.48e-06 ***
## Total.expenditure -1.576e-02 5.429e-02 -0.290 0.771592
## Diphtheria 3.017e-02 8.057e-03 3.744 0.000186 ***
## thinness..1.19.years -4.666e-02 7.979e-02 -0.585 0.558793
## thinness.5.9.years -1.410e-01 7.844e-02 -1.798 0.072391 .
## Income.composition.of.resources 2.095e+01 8.466e-01 24.749 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.463 on 2070 degrees of freedom
## (855 observations deleted due to missingness)
## Multiple R-squared: 0.5704, Adjusted R-squared: 0.5679
## F-statistic: 229 on 12 and 2070 DF, p-value: < 2.2e-16
The first step in interpreting the multiple regression analysis is to examine the F-statistic and the associated p-value, at the bottom of model summary.
In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.
To see which predictor variables are significant, you can examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statitic p-values:
summary(model)$coefficient
## Estimate Std. Error t value
## (Intercept) 48.4212274295 7.986680e-01 60.6274772
## Alcohol -0.0340324872 3.627505e-02 -0.9381790
## percentage.expenditure 0.0007291457 7.952571e-05 9.1686792
## Hepatitis.B 0.0036557413 6.144224e-03 0.5949883
## Measles 0.0000204357 1.565612e-05 1.3052846
## BMI 0.0744931387 7.662231e-03 9.7221217
## under.five.deaths -0.0009999939 1.087463e-03 -0.9195656
## Polio 0.0349547554 7.240603e-03 4.8276027
## Total.expenditure -0.0157614840 5.428787e-02 -0.2903316
## Diphtheria 0.0301699912 8.057180e-03 3.7444851
## thinness..1.19.years -0.0466556368 7.978999e-02 -0.5847305
## thinness.5.9.years -0.1410030973 7.844090e-02 -1.7975712
## Income.composition.of.resources 20.9521138649 8.465872e-01 24.7489149
## Pr(>|t|)
## (Intercept) 0.000000e+00
## Alcohol 3.482619e-01
## percentage.expenditure 1.121298e-19
## Hepatitis.B 5.519163e-01
## Measles 1.919410e-01
## BMI 7.070793e-22
## under.five.deaths 3.579069e-01
## Polio 1.482979e-06
## Total.expenditure 7.715916e-01
## Diphtheria 1.856993e-04
## thinness..1.19.years 5.587927e-01
## thinness.5.9.years 7.239068e-02
## Income.composition.of.resources 1.130122e-118
For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.
It can be seen that, change in the Alcohol,BMI,Polio, Total expenditure,Diphtheria, Thinness 1- 19 years, Thinness 5-9 years,Income composition of resources are significantly associated to life expectancy of a person.
For a given predictor variable, the coefficient (b) can be interpreted as the average effect on y of a one unit increase in predictor, holding all other predictors fixed.
We found that Measles, percentage expenditure,Hepatitis B, under five deaths cariables are not significant in the multiple regression model. We can remove these variables from our analysis.
library(tidyverse)
model <- lm(Life.expectancy ~ Alcohol + BMI + Polio+ Total.expenditure + Diphtheria + thinness..1.19.years + thinness.5.9.years + Income.composition.of.resources, data = life_expectancy_data_num)
summary(model)
##
## Call:
## lm(formula = Life.expectancy ~ Alcohol + BMI + Polio + Total.expenditure +
## Diphtheria + thinness..1.19.years + thinness.5.9.years +
## Income.composition.of.resources, data = life_expectancy_data_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.8980 -2.9211 0.1913 2.8549 24.8361
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.318748 0.677932 66.849 < 2e-16 ***
## Alcohol 0.033620 0.034374 0.978 0.328
## BMI 0.093951 0.007609 12.348 < 2e-16 ***
## Polio 0.046490 0.007006 6.635 3.94e-11 ***
## Total.expenditure 0.016001 0.052026 0.308 0.758
## Diphtheria 0.045517 0.006991 6.511 8.97e-11 ***
## thinness..1.19.years -0.114326 0.074353 -1.538 0.124
## thinness.5.9.years -0.094107 0.072858 -1.292 0.197
## Income.composition.of.resources 21.599486 0.727366 29.695 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.86 on 2547 degrees of freedom
## (382 observations deleted due to missingness)
## Multiple R-squared: 0.6124, Adjusted R-squared: 0.6112
## F-statistic: 503 on 8 and 2547 DF, p-value: < 2.2e-16
Finally our model can be written as follow:
The confidence interval of the model coefficient can be extracted as follow:
confint(model)
## 2.5 % 97.5 %
## (Intercept) 43.98939399 46.64810222
## Alcohol -0.03378338 0.10102314
## BMI 0.07903126 0.10887137
## Polio 0.03275133 0.06022859
## Total.expenditure -0.08601684 0.11801788
## Diphtheria 0.03180879 0.05922563
## thinness..1.19.years -0.26012391 0.03147274
## thinness.5.9.years -0.23697353 0.04875936
## Income.composition.of.resources 20.17319681 23.02577503
As we have seen in simple linear regression, the overall quality of the model can be assessed by examining the R-squared (R2) and Residual Standard Error (RSE).
R-squared:
In multiple linear regression, the R2 represents the correlation coefficient between the observed values of the outcome variable (y) and the fitted (i.e., predicted) values of y. For this reason, the value of R will always be positive and will range from zero to one.
R2 represents the proportion of variance, in the outcome variable y, that may be predicted by knowing the value of the x variables. An R2 value close to 1 indicates that the model explains a large portion of the variance in the outcome variable.
A problem with the R2, is that, it will always increase when more variables are added to the model, even if those variables are only weakly associated with the response (James et al. 2014). A solution is to adjust the R2 by taking into account the number of predictor variables.
The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of x variables included in the prediction model.
Residual Standard Error (RSE), or sigma:
The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate the model (on the data in hand).
The error rate can be estimated by dividing the RSE by the mean outcome variable:
sigma(model)/mean(life_expectancy_data$Life.expectancy)
## [1] 0.08465579
end